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Abstract 


A big concern to the Internet nowadays is phishing, a crime that involves 
exploiting technological tools to steal sensitive consumer data. Phishing losses 
are also rising quickly. The importance of feature engineering in solutions for 
detection of phishing websites, however the precision of detection is crucial 
and it depends on the features you know already. Additionally, although fea- 
tures retrieved from multiple dimensions are more thorough, extracting these 
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Convolutional Neural Net- Characteristics has the downside of taking a long time. To address these, we 
work (CNN); proposed a new approach in which dataset contains millions of URLs by this 
hl t-Ierm Memory approach we can identify the URL which is attacked by the phisher. To deter- 


mine whether the URL has been targeted by the phisher, some of the Convo- 
lutional Neural Network algorithms like CNN-LSTM, CNN BI-LSTM, Logistic 
Regression, and XG Boost are utilized and resulting in the correctness of the 
graph between the two machine learning methods by using trained dataset and 
more likely to produce sensitivity, specificity, precision, recall, and fl-score 
along with accuracy graph, confusion matrices and also along with ROC-AUC 
curves. 


Machine Learning 


1. Introduction for malicious motives and, indirectly, for money). 
Because utilizing bait to try to catch a victim is anal- 
ogous to fishing, this word was formed as a homo- 
phone of fishing. The two most common phishing 
techniques are email spoofing and instant messag- 
ing, which regularly persuade people to divulge per- 
sonal information on a false website that looks and 
functions exactly like the real one. Victims are fre- 
quently duped by communications posing as com- 
ing from websites for social networking, auctions, 
banks, processors of online payments, or IT man- 
agers. Links in phishing emails could lead to sites 
that have been infected with malware. Phishing is 
a type of social engineering approach that deceives 
consumers by taking advantage of flaws in current 


The development of the Internet as a vital infras- 
tructure that profoundly aids human society Internet 
users’ economies have already been seriously threat- 
ened by phishing, harmful software, and privacy rev- 
elations, which are unavoidable security challenges. 
The APWG (Anti-Phishing Working Group) (Yang, 
Zhao, and Zeng) describes phishing as a criminal 
tactic that combines technological and social engi- 
neering deception to get users’ personal information 
and login credentials for bank accounts. 


Phishing is the practice of attempting to get sen- 
sitive data (Bhavani et al.), such as usernames, pass- 
words, and payment card details. By posing as 
a trustworthy party in an electronic contact (often 
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online security (Krishna et al.). Legislation, user 
education, public awareness campaigns, technical 
security measures, and other methods are being put 
into place to combat the rise in reported phishing 
instances. But they should be clearly branded as to 
who made them and users shouldn’t use them with- 
out permission. Numerous websites have created 
additional tools for programmers, including game 
maps. 

In general, phishing is a type of cyberattack (Anil 
et al. Kumar et al.) that has a detrimental effect 
on people by deceiving them into exposing private 
information such account passwords, bank details, 
ATM pin numbers, etc. Protecting recent sensitive 
documents while guarding against malware or web 
phishing is risky. Techniques for detecting phreak- 
ing websites can mainly be categorized into four 
groups, which are shown in Figure 1 below. 


inn i 


FIGURE 1. Types of Phishing Website Detection 
Methods 


Figure 2 demonstrates the block diagram of 
phreaking websites such that six steps are involved 
to attack the user credentials in that first step is 
attacker prepares a server to host phishing script and 
page and it is stored in the attacker’s server and then 
attacker compromises website and injects malicious 
script in the website. In third step user accesses 
compromises websites and runs the injected script 
and then injected script loads and phishing script 
from attacker’s server and next step phishing script 
shows fake login credentials [9]to the user and last 
step victim enters credentials on the fake form and 
then the credentials are sent to the attacker. 
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2. Attacker compromises 
website and injects malicious 


script in the website 
Attacker 


Ea) 


3. User accesses 
compromises 
website and runs 
the injected script 


1.Attacker 

prepares a server 
to host phishing 

script and page 
5. Phishing 
script shows 
fake login 
form on the 
user’s screen 


4. The injected script loads actual 
phishing script from attacker’s 


server 


6. User enters credentials on the 
fake form and then credentials 
are sent to the attacker 


FIGURE 2. Block Diagram of Phishing Process 


The search space dimension has increased as a 
result of the classification model frequently being 
trained using a high number of features. Hughe’s 
effect (Dharani et al.), also referred to as the Curse 
of Dimensionality, asserts that a classifier’s perfor- 
mance only shows a steady increase up to a par- 
ticular threshold dimensionality before falling. To 
resolve this problem, a feature selection approach 
must be used. 

Despite the fact that traditional machine learn- 
ing algorithms are extremely prone to under fitting 
and overfitting, they may not necessarily produce 
the best results. This issue might be solved using 
ensemble machine learning approaches and deep 
learning techniques. 

Deep learning is built on machine learning, a 
branch of artificial intelligence. Deep learning will 
succeed because neural networks reproduce how the 
human brain functions. In deep learning, nothing is 
explicitly coded. 

A specific kind of neural network called a con- 
volutional neural network is frequently employed in 
the fields of object recognition, image classification, 
and image clustering. DNNs enable the construction 
of hierarchical visual representations. More than 
any other neural network, deep convolutional neural 
networks are advised for achieving the best accu- 
racy. 


2. Literature Survey 


To understand the reviews regarding whether or not 
a website has been attacked by phishers, the previ- 
ous study may be discovered in the literary survey. 
e Lizhen Tang And Qusay H. Mahmoud (2022) 
The framework which has been proposed by the 
Lizhen Tang and Qusay Mahmoud has put into place 
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as a browser plug-in that can identify phishing risks 
in real time when a user visits a website and issue 
a warning. The real-time prediction service inte- 
grates several techniques, such as whitelist filtering, 
blacklist interception, and machine learning (ML) 
prediction. They have compared various machine 
learning models utilizing various datasets in the 
ML prediction module. The RNN-GRU model has 
the highest accuracy of 99.18% according to the 
trial findings, proving the viability of the suggested 
approach. (Tang and Mahmoud) 

e Peng Yang, Guangzhen Zhao, And Peng Zeng 
(2019) 

They suggested a multidimensional feature phish- 
ing detection methodology based on a quick detec- 
tion method by utilizing deep learning to address the 
constraints. In this, they mix deep learning’s rapid 
classification output with URL statistical data, web- 
page code features, webpage text features, and mul- 
tidimensional features. Test results on a dataset with 
millions of legitimate and phishing URLs show that 
the accuracy is 98.99% and the false positive rate is 
only 0.59%. (Yang, Zhao, and Zeng) 

e Rishikesh Mahajan, Irfan Siddavatam (2018) 

In order to distinguish between legal and phish- 
ing URLs, this article uses machine learning tech- 
nology. It extracts and analyses many aspects of 
both types of URLs. Algorithms such as Support 
Vector Machine, Decision Tree, and Random Forest 
are used to identify phishing websites. By evaluat- 
ing each algorithm’s accuracy rate, false positive and 
false negative rates, the study aims to identify phish- 
ing URLs and identify the best machine learning 
method with the highest accuracy of 97.4% of Ran- 
dom Forest algorithm. (Mahajan and Siddavatam) 

e Arathi Krishna V, Anusree A, Blessy Jose, 
Karthika Anilkumar, Ojus Thomas Lee (2021) 

In this paper they done the work on the identi- 
fication of phishing URLs, or to categorise a URL 
as phishing or legitimate, various machine learning 
techniques are used. our goal in this work is to 
review several machine learning techniques utilised 
for this purpose. The objective is to establish a sur- 
vey resource for academics to learn about recent 
advancements in the industry and help develop 
phishing detection models that produce more reli- 
able findings. (Krishna et al.) 

e Dr Anil Gn, G Om Prakash, K Harsha Manoj, 
M Lokesh, Madhusudhan K (2020) 
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They created resource descriptions, in which they 
use combination of methods to detect phishing web- 
sites. In order to train their programme, we use 
supervised learning approaches. This approach 
has a very encouraging score, which is commend- 
able. Also, they employed a software programme 
to remove features that allow quantifying the fre- 
quency of each job within the dataset in addition 
to a random forest classification to handle incom- 
plete data sets. In order to test the effectiveness of 
the Random Forest Algorithm and ensemble learn- 
ing techniques, the accuracy is impressive. (Anil et 
al.) 

e Naresh Kumar D, Nemala Sai Rama Hemanth, 
Premnath S, Nishanth Kumar V, Uma S (2020) 

This study proposes a unique machine learning- 
based classification technique with heuristic fea- 
tures, where feature selection may be taken from 
properties such as Uniform Resource Locator, 
Source Code, Session, and so on. Five machine 
learning techniques, including random forest, K 
Nearest Neighbour, decision tree, support vector 
machine, and logistic regression, were used to assess 
the suggested model. The random forest approach 
outperforms existing models, detecting attacks with 
an accuracy of 91.4%. Moreover, the Random For- 
est Model chooses the best data using orthogonal 
and oblique classifiers. (Kumar et al.) 


3. Methodology 
3.1. Dataset Description: 


The model’s training data set was obtained from 
Kaggle.com. This has more than 25  thou- 
sand (Prabakaran, Chandrasekar, and Sundaram 
Elsadig et al.) data entries and 48 attributes. 80 per- 
cent of them are regarded as training data, whereas 
20 percent are test data. 


3.2. Data Preprocessing: 


The process of translating raw data into information 
that a machine learning model can utilize is known 
as data preparation. It is both the initial and most 
critical phase in the building of a machine learn- 
ing model. Preprocessing data entails the following 
actions: 


3.2.1. Getting the Dataset: 


Data is the foundation of all machine learning mod- 
els; Thus, the first thing needed to develop one is 
a dataset. The dataset is the bundle of data that 
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has been properly arranged for a certain issue. For 
instance, the dataset needed for someone to build 
a business-oriented machine learning model will be 
distinct than the collection needed for another pur- 
pose (Mahajan and Siddavatam). Datasets can take 
on several forms and serve a variety of functions. 
Here, we’re using a dataset in the.csv format for 
this project. The project is carried out utilizing this 
dataset. 


3.2.2. Importing Libraries: 


To done data preparation in Python, we must import 
a number of predefined Python packages. Several 
of the libraries are used to complete some particular 
tasks. 


3.2.3. Importing the Datasets: 


In order to achieve the required results for our 
research, we must import the dataset in this phase. 
But first, we must make the current directory the 
working directory in order to import a dataset. 
3.2.4. Splitting the Dataset into Training set and Testing 
Set: 
our data into a test set and a training set through- 
out the machine learning data preparation stage. 
One of the most important steps in data preparation 
because it allows us to improve the performance of 
our machine learning model. 


Dataset 


Training Set Test Set 


FIGURE 3. Splitting the Dataset 


3.3. Various Algorithms Used 


Here, we employ a variety of techniques, includ- 
ing two machine learning algorithms and two deep 
learning models, such as convolutional neural net- 
work algorithms, for another two. 


3.3.1. Machine Learning Algorithms 


Logistic Regression:: A categorical dependent variable 
is used in the regression model known as logistic 
regression (DV). A mathematical technique called 
logistic regression can be used to calculate a binary 
response’s likelihood given one or more indepen- 
dent variables. Logistic regression is used to fore- 
cast outcomes with two alternative values, such as 
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0 or 1, pass or fail, yes or no, and so forth. The 
logistic regression is a type of regression model. is 
a prognostic study (Choudhary et al.). It is widely 
employed in data visualization to emphasize the 
connection between a binary a nominal, ordinal, 
interval, or ratio-level one or more independent fac- 
tors and the dependent variable. It also necessitates 
a cost function that is trickier. in place of being a lin- 
ear. The term ’sigmoid function” or ”logistic func- 
tion” is used to describe this cost function. Equation 
demonstrates that the algorithm’s hypothesis holds 
between 0 and | for the cost function limit. tends 
to hold. The only possible values for the binary 
dependent variable included in this logistic regres- 
sion are ’0” and 1,” which represent outcomes such 
as ”Yes/No,” ”True,” ’False,’ ”High,’ and ’Low,” 
among others. 
O0<h(x)<1 


XG Boost:: XG Boost is the abbreviation for Extreme 
Gradient Boosting. The application for gradient- 
boosted decision trees was developed with effi- 
ciency and speed in mind. Boosting is a type of 
ensemble learning that incorporates extra techniques 
to correct flaws in previous models. The number 
of models is gradually increased until there is no 
more opportunity for improvement (Bhavani et al.). 
To reduce the loss when incorporating new mod- 
els, it employs a gradient descent technique. This 
approach provides quick memory and computing 
time. This approach aimed to train the model with 
the least number of resources possible. Model per- 
formance and execution speed are XG Boost’s two 
key benefits. 


gj ee 


Result 


FIGURE 4. Architecture of XGboost 


3.3.2. Deep Learning Algorithms: 


Deep neural networks, such as convolutional neu- 
ral networks (CNN), are frequently used to evaluate 
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mental images When we think of neural networks, 
we frequently think of matrix multiplications. It 
uses uses it uses, it uses it using the convolution. 
In mathematics, convolution shows how the shape 
of one function is altered by another by fusing two 
functions to produce a third function. 


input layer 


hidden layer 1 hidden layer 2 


FIGURE 5. Architecture of CNN 


CNN-LSTM (Convolutional Neural Network-Long Short- 
-Term Memory: Due to CNN and LSTM’s accessibil- 
ity, their combination is a common idea for merging 
benefits. This work integrated CNN and LSTM to 
provide the idea for an unique deep learning scheme. 
To make sure the multidimensional data was appro- 
priately correlated and collected, two layers of CNN 
were used (Tang and Mahmoud Yang, Zhao, and 
Zeng). The LSTM algorithm received a set of fea- 
ture series from the CNN layer as input. The layer 
LSTM extracted time dependencies in greater detail. 
The URL input matrix is insufficient to appropri- 
ately reflect the data on the phishing website. In this 
section, multidimensional features that thoroughly 
explain the entire flow are generated by combin- 
ing the CNNLSTM URL, a web page code, a text 
function, and a rapid grading result. Phishers typi- 
cally create phishing URLs by replicating the URL 
of your website in an effort to confound consumers. 


LSTM = oe 


fi 


FIGURE 6. CNN-LSTM 
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CNN BI-LSTM (CNN Bidirectional Long Short-Term Mem- 
ory: Bidirectional long short-term memory is a kind 
of recurrent neural networks. It consists of two 
concealed layers that combine forward and reverse 
data processing, enabling the structure to remember 
information from earlier input (Tang and Mahmoud 
Yang, Zhao, and Zeng). In our suggested archi- 
tecture, it is the second layer, and its purpose is to 
keep track of previous transactions that are useful 
for forecasting the output y, which may be stated as 
follows. 

y'= g (w, [h’, c’]+b,) 

where w is the weight value given to the concate- 
nation of the hidden and current states produced by 
the Bi-LSTM, h and c are the hidden and current 
states, and t = transaction. 


3.4. Proposed Framework 


The two steps of the suggested system are the classi- 
fication phase and the phishing detection phase and 
its proposed framework is depicted as follows: 


eae — 


User Input Pre-processing es ing | 

Dataset ne ‘Module | 

Detected Phishing Attack Detection 
URL Module 


FIGURE 7. Proposed Framework for Phishing 
Attack 


3.4.1. Classification Phase: 


Regular URLs and suspect URLs for phishing web- 
sites make up the input for the categorization step. 
These inputs are sent to three submodules: the Data 
Collecting module, the Feature Selection module, 
and the Classification module. The feature extrac- 
tion module takes into consideration the Address 
Bar, features with an anomalous basis, and features 
with a domain basis. These attributes are provided 
as input to the categorization module. By contrast- 
ing their URLs with those of real websites, the clas- 
sification module’s main goal is to accurately detect 
phishing websites. In order for the classifier to suc- 
cessfully identify phishing Sites, feature selection’s 
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primary goal is to extract the true and necessary fea- 
tures from the attributes offered by the feature selec- 
tion module. Two classifiers that use machine learn- 
ing and two that use deep learning make up the pro- 
posed study. 


3.4.2. Detection Phase: 


The primary goal of this module is to identify phish- 
ing URLs from a dataset that includes a large num- 
ber of URLs using information collected from fea- 
ture extraction module’s characteristics. 


4. Performance Metrics 


4.1. Metrics considered for performance 
evaluation 


The category or categories of data are found while 
using training data to solve a classification problem. 
The model gains knowledge from the previously 
provided dataset classifying the groups or classes of 
fresh data in accordance with the training. In the 
response, it makes predictions about whether a class 
will be Yes or No, 0 or 1, spam or not, etc (Elsadig et 
al.). A categorization model’s effectiveness is mea- 
sured using a range of measures, some of which are 
as follows: 


4.1.1, Accuracy 


The proportion of correct predictions a model makes 
out of all feasible ones when doing categorization 
tasks. 


Accuracy =(TP+TN)/(TP+FP+FN+TN) 


4.1.2. Precision 


Precision is the measure of the proportion of correct 
positive forecasts (Buber, Demir, and Sahingoz). It 
can be calculated as the True Positive, or the per- 
centage of all accurate positive forecasts (True Pos- 
itive and False Positive). 


Precision =TP/(TP+ FP) (2) 


4.1.3. Recall 


It aims to quantify the proportion of false positives 
that were in fact real positives. You can compute it 
using the TP formula, that contrasts the total number 
of correctly expected positives or incorrectly pre- 
dicted negatives with the number of accurate fore- 
casts (TP and FN). 


Recall(R) =TP/(TP+ FN) (3) 
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4.1.4. F1-Score 


Considering the forecasts offered for the positive 
class, the F-score or Fl Score measure is used to 
assess a binary classification model. It is calculated 
by using Precision and Recall (Al-Ahmadi, Alotaibi, 
and Alsaleh). As a result, the Fl Score can be delib- 
erated with equal weights for each variable using the 
harmonic means of recall and precision. 


F1 — Score = 2 * (Precision * Recall)/ (4) 
(Precision + Recall) 


4.1.5. Confusion Matrix 


A confusion matrix, a tabular representation of the 
anticipated results is used to demonstrate how well 
a binary classifier performed on a set of test data 
when true values were known. 


4.1.6. Specificity 


The proportion of true negatives to true negatives 
and false positives that the model correctly picks up 
is known as specificity. 


Specificity =TN/(TN + FP) (5) 


4.1.7. Sensitivity 


In machine learning, the metric known as sensitivity 
is used to evaluate the capacity of a model to forecast 
each available category’s true positives. 


Sensitivity —TP/(TP + FN) (6) 


5. Output of Experiments and Analysis 


The training and testing portions of the dataset for 
the model are split 80:20. This is a summary of the 
machine learning models that were applied to our 
suggested framework. Table | displays the accuracy 
of the various algorithms we used, and Table 2 dis- 
plays the matrices of the two algorithms that were 
compared using the metrics of logistic regression 
and XG boost. Figure 8 shows the accuracy of the all 
algorithms used in this project where x-axis shows 
the algorithms and y-axis shows the accuracy per- 
centage achieved by the algorithms. Figure 9, Fig- 
ure 10 shows the confusion matrix and ROC curve 
achieved by logistic regression algorithm. Figure 
11, Figure 12 shows the Confusion matrix and Roc 
curve by XG Boost algorithm and figure 13 shows 
that the comparison of roc curves. 
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TABLE 1. Accuracy of different algorithms 


Algorithms Accuracy gad ano 

CNN LSTM 56.4% 

CNN BI-LSTM 55.9% all 

Logistic Regression 91.6% g 

XG Boost 92% g. 

TABLE 2. Metrics of two algorithms i: " 

Metrics Logistic Regression XG Boost 0.2 4 
Precision 0.95 1.00 
Recall 0.93 1.00 0.0 | 
F1-Score 0.93 1.00 0.0 0.2 0.4 0.6 08 10 
Sensitivity 0.9 1.00 False Positive Rate 
Specificity 0.09 1.00 FIGURE 10. ROC Curve for Logistic Regression 
ROC-AUC_ 0.96 0.95 


Confusion Matrix of XGBOOST 
o 1 


2204 ie) 


Actuals 


0 2890 


Predictions 


FIGURE 11. Confusion Matrix of XG Boost 


Logit Accuracy XGB Accuracy CNN LSTM CNN BI-LSTM 


FIGURE 8. Comparison of four different Algo- 1.0] —-— XGBoost 
rithms 
0.8 4 
Confusion Matrix of Logistic Regression g 
= 0.6 
& 
1889 287 8 
o 0.44 
2 
a E 
2 0.24 
123 2795 
0.0 
_ 0.0 0.2 0.4 0.6 0.8 1.0 
False Positive Rate 
FIGURE 9. Confusion Matrix of Logistic Regres- FIGURE 12. ROC Curve for XG Boost 
sion 


6. Conclusion 


In this study, the problem of phishing assaults is 
taken into consideration, and a useful model is given 
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1.04 


0.8 4 


S 
a 


oO 
+ 
1 


True Positive Rate 


0.2 4 


--- XG Boost 
0.0 4 —— Logistic 


0.0 0.2 04 0.6 08 10 
False Positive Rate 
FIGURE 13. Comparison of Logistic Regression 
and XG Boost ROC Curves 


using the CNN LSTM, CNN-Bi-LSTM, Logistic 
regression, and XG Boost algorithms, which inte- 
grate deep neural networks and machine learning to 
identify and categorise malicious URLs. The Logis- 
tic regression and XG Boost algorithm model pro- 
duce an excellent accuracy in detecting the compar- 
ison of phreaking URLs to the most widely used 
LSTM model. The model’s suitability is demon- 
strated by the analysis, which yields 92% accuracy 
along with performance data. To make this applica- 
tion accessible to everyone, we can further expand it 
as a website. 
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